Occurrence Based Categorical Data Clustering Using Cosine and Binary Matching Similarity Measure

نویسنده

  • J. AKILANDESWARI
چکیده

Clustering is the process of grouping a set of physical objects into classes of similar object. Objects in real world consist of both numerical and categorical data. Categorical data are not analyzed as numerical data because of the absence of inherit ordering. This paper describes about occurrence based categorical data clustering (OBCDC) technique based on cosine similarity measure and simple binary matching similarity measure. The OBCDC system consists of four modules, such as data pre-processing, similarity matrix generation, cluster formation and validation. Similarity matrix generation uses three functions, namely FrequencyComputation, OccurranceBasedCosine and OccurranceBasedSBMS. The time complexity of various algorithms are discussed and its performance on real world data are measured using accuracy and error rate

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ارائه یک الگوریتم خوشه بندی برای داده های دسته ای با ترکیب معیارها

Clustering is one of the main techniques in data mining. Clustering is a process that classifies data set into groups. In clustering, the data in a cluster are the closest to each other and the data in two different clusters have the most difference. Clustering algorithms are divided into two categories according to the type of data: Clustering algorithms for numerical data and clustering algor...

متن کامل

Improved K-Modes for Categorical Clustering Using Weighted Dissimilarity Measure

K-Modes is an extension of K-Means clustering algorithm, developed to cluster the categorical data, where the mean is replaced by the mode. The similarity measure proposed by Huang is the simple matching or mismatching measure. Weight of attribute values contribute much in clustering; thus in this paper we propose a new weighted dissimilarity measure for K-Modes, based on the ratio of frequency...

متن کامل

Computing Similarity Measure Based on Names of Goods for Fuzzy Clustering

Clustering is a method which can be helpful in retrieval of relevant information from databases [9, 1]. Businesses use databases to gather information about completed transactions. Huge part of it has a linguistic form or categorical attributes, which do not have a natural order. Data in that form can be clustered by using of the measure of similarity [6]. Most of earlier work focused on form c...

متن کامل

Binary-based similarity measures for categorical data and their application in Self- Organizing Maps

In exploratory data analysis of high dimensional data one Eof the main tasks is the formation of a simplified overview of data sets. Clustering and projection are among the examples of useful methods to achieve this task. However there are several types of data where the use of this measure is not adequate, such as the categorical data. In this paper we will review some of the most common binar...

متن کامل

Document Retrieval using Hierarchical Agglomerative Clustering with Multi-view point Similarity Measure Based on Correlation: Performance Analysis

Clustering is one of the most interesting and important tool for research in data mining and other disciplines. The aim of clustering is to find the relationship among the data objects, and classify them into meaningful subgroups. The effectiveness of clustering algorithms depends on the appropriateness of the similarity measure between the data in which the similarity can be computed. This pap...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014